import pandas as pd
# Reading cvs files:
# TESLA SEARCH:
df_apps = pd.read_csv('D:\\Desktop\\Study\\100 Days of Code - The Complete Python Pro Bootcamp for 2021\\DATA\\Android App Store\\apps.csv')
df_apps.head()
| App | Category | Rating | Reviews | Size_MBs | Installs | Type | Price | Content_Rating | Genres | Last_Updated | Android_Ver | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Ak Parti Yardım Toplama | SOCIAL | NaN | 0 | 8.7 | 0 | Paid | $13.99 | Teen | Social | July 28, 2017 | 4.1 and up |
| 1 | Ain Arabic Kids Alif Ba ta | FAMILY | NaN | 0 | 33.0 | 0 | Paid | $2.99 | Everyone | Education | April 15, 2016 | 3.0 and up |
| 2 | Popsicle Launcher for Android P 9.0 launcher | PERSONALIZATION | NaN | 0 | 5.5 | 0 | Paid | $1.49 | Everyone | Personalization | July 11, 2018 | 4.2 and up |
| 3 | Command & Conquer: Rivals | FAMILY | NaN | 0 | 19.0 | 0 | NaN | 0 | Everyone 10+ | Strategy | June 28, 2018 | Varies with device |
| 4 | CX Network | BUSINESS | NaN | 0 | 10.0 | 0 | Free | 0 | Everyone | Business | August 6, 2018 | 4.1 and up |
df_apps.tail()
| App | Category | Rating | Reviews | Size_MBs | Installs | Type | Price | Content_Rating | Genres | Last_Updated | Android_Ver | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 10836 | Subway Surfers | GAME | 4.5 | 27723193 | 76.0 | 1,000,000,000 | Free | 0 | Everyone 10+ | Arcade | July 12, 2018 | 4.1 and up |
| 10837 | Subway Surfers | GAME | 4.5 | 27724094 | 76.0 | 1,000,000,000 | Free | 0 | Everyone 10+ | Arcade | July 12, 2018 | 4.1 and up |
| 10838 | Subway Surfers | GAME | 4.5 | 27725352 | 76.0 | 1,000,000,000 | Free | 0 | Everyone 10+ | Arcade | July 12, 2018 | 4.1 and up |
| 10839 | Subway Surfers | GAME | 4.5 | 27725352 | 76.0 | 1,000,000,000 | Free | 0 | Everyone 10+ | Arcade | July 12, 2018 | 4.1 and up |
| 10840 | Subway Surfers | GAME | 4.5 | 27711703 | 76.0 | 1,000,000,000 | Free | 0 | Everyone 10+ | Arcade | July 12, 2018 | 4.1 and up |
df_apps.shape
(10841, 12)
Tells us we have 10841 rows and 12 columns.
We can already see that there are some data issues that we need to fix. In the Ratings and Type columns there are NaN (Not a number values) and in the Price column we have dollar signs that will cause problems.
The .sample(n) method will give us n random rows. This is another handy way to inspect our DataFrame.
df_apps.sample(5)
| App | Category | Rating | Reviews | Size_MBs | Installs | Type | Price | Content_Rating | Genres | Last_Updated | Android_Ver | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2230 | DX Alert | TOOLS | 4.1 | 27 | 2.5 | 1,000 | Free | 0 | Everyone | Tools | April 28, 2016 | 3.2 and up |
| 2623 | EV Connect | MAPS_AND_NAVIGATION | 2.8 | 25 | 5.8 | 1,000 | Free | 0 | Everyone | Maps & Navigation | August 6, 2018 | 5.0 and up |
| 6621 | Augment - 3D Augmented Reality | BUSINESS | 4.1 | 25195 | 21.0 | 1,000,000 | Free | 0 | Everyone | Business | February 26, 2018 | 4.0.3 and up |
| 10758 | Flipboard: News For Our Time | NEWS_AND_MAGAZINES | 4.4 | 1284017 | 6.3 | 500,000,000 | Free | 0 | Everyone 10+ | News & Magazines | August 3, 2018 | Varies with device |
| 10709 | Farm Heroes Saga | FAMILY | 4.4 | 7615646 | 71.0 | 100,000,000 | Free | 0 | Everyone | Casual | August 7, 2018 | 2.3 and up |
To remove the unwanted columns, we simply provide a list of the column names ['Last_Updated', ‘Android_Ver'] to the .drop() method. By setting axis=1 we are specifying that we want to drop certain columns.
df_apps.drop(['Last_Updated', 'Android_Ver'], axis=1, inplace=True)
df_apps.head()
| App | Category | Rating | Reviews | Size_MBs | Installs | Type | Price | Content_Rating | Genres | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Ak Parti Yardım Toplama | SOCIAL | NaN | 0 | 8.7 | 0 | Paid | $13.99 | Teen | Social |
| 1 | Ain Arabic Kids Alif Ba ta | FAMILY | NaN | 0 | 33.0 | 0 | Paid | $2.99 | Everyone | Education |
| 2 | Popsicle Launcher for Android P 9.0 launcher | PERSONALIZATION | NaN | 0 | 5.5 | 0 | Paid | $1.49 | Everyone | Personalization |
| 3 | Command & Conquer: Rivals | FAMILY | NaN | 0 | 19.0 | 0 | NaN | 0 | Everyone 10+ | Strategy |
| 4 | CX Network | BUSINESS | NaN | 0 | 10.0 | 0 | Free | 0 | Everyone | Business |
To find and remove the rows with the NaN values we can create a subset of the DataFrame based on where .isna() evaluates to True. We see that NaN values in ratings are associated with no reviews (and no installs). That makes sense.
nan_rows = df_apps[df_apps.Rating.isna()]
print(nan_rows.shape)
nan_rows.head()
(1474, 10)
| App | Category | Rating | Reviews | Size_MBs | Installs | Type | Price | Content_Rating | Genres | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Ak Parti Yardım Toplama | SOCIAL | NaN | 0 | 8.7 | 0 | Paid | $13.99 | Teen | Social |
| 1 | Ain Arabic Kids Alif Ba ta | FAMILY | NaN | 0 | 33.0 | 0 | Paid | $2.99 | Everyone | Education |
| 2 | Popsicle Launcher for Android P 9.0 launcher | PERSONALIZATION | NaN | 0 | 5.5 | 0 | Paid | $1.49 | Everyone | Personalization |
| 3 | Command & Conquer: Rivals | FAMILY | NaN | 0 | 19.0 | 0 | NaN | 0 | Everyone 10+ | Strategy |
| 4 | CX Network | BUSINESS | NaN | 0 | 10.0 | 0 | Free | 0 | Everyone | Business |
df_apps_clean = df_apps.dropna()
df_apps_clean.shape
(9367, 10)
This leaves us with 9,367 entries in our DataFrame. But there may be other problems with the data too.
Are there any duplicates in data? Check for duplicates using the .duplicated() function.
df_apps_duplicated = df_apps_clean[df_apps_clean.duplicated()]
print(df_apps_duplicated.shape)
df_apps_duplicated.head()
(476, 10)
| App | Category | Rating | Reviews | Size_MBs | Installs | Type | Price | Content_Rating | Genres | |
|---|---|---|---|---|---|---|---|---|---|---|
| 946 | 420 BZ Budeze Delivery | MEDICAL | 5.0 | 2 | 11.0 | 100 | Free | 0 | Mature 17+ | Medical |
| 1133 | MouseMingle | DATING | 2.7 | 3 | 3.9 | 100 | Free | 0 | Mature 17+ | Dating |
| 1196 | Cardiac diagnosis (heart rate, arrhythmia) | MEDICAL | 4.4 | 8 | 6.5 | 100 | Paid | $12.99 | Everyone | Medical |
| 1231 | Sway Medical | MEDICAL | 5.0 | 3 | 22.0 | 100 | Free | 0 | Everyone | Medical |
| 1247 | Chat Kids - Chat Room For Kids | DATING | 4.7 | 6 | 4.9 | 100 | Free | 0 | Mature 17+ | Dating |
How many entries can you find for the "Instagram" app? Use .drop_duplicates() to remove any duplicates from df_apps_clean.
We can actually check for an individual app like ‘Instagram’ by looking up all the entries with that name in the App column.
df_apps_clean[df_apps_clean.App == "Instagram"]
| App | Category | Rating | Reviews | Size_MBs | Installs | Type | Price | Content_Rating | Genres | |
|---|---|---|---|---|---|---|---|---|---|---|
| 10806 | SOCIAL | 4.5 | 66577313 | 5.3 | 1,000,000,000 | Free | 0 | Teen | Social | |
| 10808 | SOCIAL | 4.5 | 66577446 | 5.3 | 1,000,000,000 | Free | 0 | Teen | Social | |
| 10809 | SOCIAL | 4.5 | 66577313 | 5.3 | 1,000,000,000 | Free | 0 | Teen | Social | |
| 10810 | SOCIAL | 4.5 | 66509917 | 5.3 | 1,000,000,000 | Free | 0 | Teen | Social |
So how do we get rid of duplicates? Can we simply call .drop_duplicates()?
# df_apps_clean = df_apps_clean.drop_duplicates()
# Not really. If we do this without specifying how to identify duplicates, we see that 3 copies of Instagram are
# retained because they have a different number of reviews.
# We need to provide the column names that should be used in the comparison to identify duplicates. For example:
# We need to specify the subset to identify duolicates:
df_apps_clean = df_apps_clean.drop_duplicates(subset = ["App", "Type", "Price"])
df_apps_clean[df_apps_clean.App == "Instagram"]
| App | Category | Rating | Reviews | Size_MBs | Installs | Type | Price | Content_Rating | Genres | |
|---|---|---|---|---|---|---|---|---|---|---|
| 10806 | SOCIAL | 4.5 | 66577313 | 5.3 | 1,000,000,000 | Free | 0 | Teen | Social |
So we can see that 13 different features were originally scraped from the Google Play Store.
Obviously, the data is just a sample out of all the Android apps. It doesn't include all Android apps of which there are millions.
I’ll assume that the sample is representative of the App Store as a whole. This is not necessarily the case as, during the web scraping process, this sample was served up based on geographical location and user behaviour of the person who scraped it - in our case Lavanya Gupta.
The data was compiled around 2017/2018. The pricing data reflect the price in USD Dollars at the time of scraping. (developers can offer promotions and change their app’s pricing).
I’ve converted the app’s size to a floating-point number in MBs. If data was missing, it has been replaced by the average size for that category.
The installs are not the exact number of installs. If an app has 245,239 installs then Google will simply report an order of magnitude like 100,000+. I’ve removed the '+' and we’ll assume the exact number of installs in that column for simplicity.
Identify which apps are the highest rated. What problem might you encounter if you rely exclusively on ratings alone to determine the quality of an app?
df_apps_clean.head()
| App | Category | Rating | Reviews | Size_MBs | Installs | Type | Price | Content_Rating | Genres | |
|---|---|---|---|---|---|---|---|---|---|---|
| 21 | KBA-EZ Health Guide | MEDICAL | 5.0 | 4 | 25.0 | 1 | Free | 0 | Everyone | Medical |
| 28 | Ra Ga Ba | GAME | 5.0 | 2 | 20.0 | 1 | Paid | $1.49 | Everyone | Arcade |
| 47 | Mu.F.O. | GAME | 5.0 | 2 | 16.0 | 1 | Paid | $0.99 | Everyone | Arcade |
| 82 | Brick Breaker BR | GAME | 5.0 | 7 | 19.0 | 5 | Free | 0 | Everyone | Arcade |
| 99 | Anatomy & Physiology Vocabulary Exam Review App | MEDICAL | 5.0 | 1 | 4.6 | 5 | Free | 0 | Everyone | Medical |
df_apps_clean.Rating.max()
5.0
df_apps_hight_rated = df_apps_clean[df_apps_clean.Rating == 5.0]
df_apps_hight_rated.head()
| App | Category | Rating | Reviews | Size_MBs | Installs | Type | Price | Content_Rating | Genres | |
|---|---|---|---|---|---|---|---|---|---|---|
| 21 | KBA-EZ Health Guide | MEDICAL | 5.0 | 4 | 25.0 | 1 | Free | 0 | Everyone | Medical |
| 28 | Ra Ga Ba | GAME | 5.0 | 2 | 20.0 | 1 | Paid | $1.49 | Everyone | Arcade |
| 47 | Mu.F.O. | GAME | 5.0 | 2 | 16.0 | 1 | Paid | $0.99 | Everyone | Arcade |
| 82 | Brick Breaker BR | GAME | 5.0 | 7 | 19.0 | 5 | Free | 0 | Everyone | Arcade |
| 99 | Anatomy & Physiology Vocabulary Exam Review App | MEDICAL | 5.0 | 1 | 4.6 | 5 | Free | 0 | Everyone | Medical |
df_apps_hight_rated = df_apps_clean.sort_values('Rating', ascending=False).head()
df_apps_hight_rated.head()
| App | Category | Rating | Reviews | Size_MBs | Installs | Type | Price | Content_Rating | Genres | |
|---|---|---|---|---|---|---|---|---|---|---|
| 21 | KBA-EZ Health Guide | MEDICAL | 5.0 | 4 | 25.0 | 1 | Free | 0 | Everyone | Medical |
| 1230 | Sway Medical | MEDICAL | 5.0 | 3 | 22.0 | 100 | Free | 0 | Everyone | Medical |
| 1227 | AJ Men's Grooming | LIFESTYLE | 5.0 | 2 | 22.0 | 100 | Free | 0 | Everyone | Lifestyle |
| 1224 | FK Dedinje BGD | SPORTS | 5.0 | 36 | 2.6 | 100 | Free | 0 | Everyone | Sports |
| 1223 | CB VIDEO VISION | PHOTOGRAPHY | 5.0 | 13 | 2.6 | 100 | Free | 0 | Everyone | Photography |
Only apps with very few reviews (and a low number on installs) have perfect 5 star ratings (most likely by friends and family).
What's the size in megabytes (MB) of the largest Android apps in the Google Play Store. Based on the data, do you think there could be a limit in place or can developers make apps as large as they please?
df_apps_clean.Size_MBs.max()
100.0
# Another method to find apps with large size:
df_apps_clean.sort_values('Size_MBs', ascending=False).head()
| App | Category | Rating | Reviews | Size_MBs | Installs | Type | Price | Content_Rating | Genres | |
|---|---|---|---|---|---|---|---|---|---|---|
| 9942 | Talking Babsy Baby: Baby Games | LIFESTYLE | 4.0 | 140995 | 100.0 | 10,000,000 | Free | 0 | Everyone | Lifestyle;Pretend Play |
| 10687 | Hungry Shark Evolution | GAME | 4.5 | 6074334 | 100.0 | 100,000,000 | Free | 0 | Teen | Arcade |
| 9943 | Miami crime simulator | GAME | 4.0 | 254518 | 100.0 | 10,000,000 | Free | 0 | Mature 17+ | Action |
| 9944 | Gangster Town: Vice District | FAMILY | 4.3 | 65146 | 100.0 | 10,000,000 | Free | 0 | Mature 17+ | Simulation |
| 3144 | Vi Trainer | HEALTH_AND_FITNESS | 3.6 | 124 | 100.0 | 5,000 | Free | 0 | Everyone | Health & Fitness |
Here we can clearly see that there seems to be an upper bound of 100 MB for the size of an app. A quick google search would also have revealed that this limit is imposed by the Google Play Store itself. It’s interesting to see that a number of apps actually hit that limit exactly.
Which apps have the highest number of reviews? Are there any paid apps among the top 50?
df_apps_clean.sort_values('Reviews', ascending=False).head(50)
| App | Category | Rating | Reviews | Size_MBs | Installs | Type | Price | Content_Rating | Genres | |
|---|---|---|---|---|---|---|---|---|---|---|
| 10805 | SOCIAL | 4.1 | 78158306 | 5.30 | 1,000,000,000 | Free | 0 | Teen | Social | |
| 10785 | WhatsApp Messenger | COMMUNICATION | 4.4 | 69119316 | 3.50 | 1,000,000,000 | Free | 0 | Everyone | Communication |
| 10806 | SOCIAL | 4.5 | 66577313 | 5.30 | 1,000,000,000 | Free | 0 | Teen | Social | |
| 10784 | Messenger – Text and Video Chat for Free | COMMUNICATION | 4.0 | 56642847 | 3.50 | 1,000,000,000 | Free | 0 | Everyone | Communication |
| 10650 | Clash of Clans | GAME | 4.6 | 44891723 | 98.00 | 100,000,000 | Free | 0 | Everyone 10+ | Strategy |
| 10744 | Clean Master- Space Cleaner & Antivirus | TOOLS | 4.7 | 42916526 | 3.40 | 500,000,000 | Free | 0 | Everyone | Tools |
| 10835 | Subway Surfers | GAME | 4.5 | 27722264 | 76.00 | 1,000,000,000 | Free | 0 | Everyone 10+ | Arcade |
| 10828 | YouTube | VIDEO_PLAYERS | 4.3 | 25655305 | 4.65 | 1,000,000,000 | Free | 0 | Teen | Video Players & Editors |
| 10746 | Security Master - Antivirus, VPN, AppLock, Boo... | TOOLS | 4.7 | 24900999 | 3.40 | 500,000,000 | Free | 0 | Everyone | Tools |
| 10584 | Clash Royale | GAME | 4.6 | 23133508 | 97.00 | 100,000,000 | Free | 0 | Everyone 10+ | Strategy |
| 10763 | Candy Crush Saga | GAME | 4.4 | 22426677 | 74.00 | 500,000,000 | Free | 0 | Everyone | Casual |
| 10770 | UC Browser - Fast Download Private & Secure | COMMUNICATION | 4.5 | 17712922 | 40.00 | 500,000,000 | Free | 0 | Teen | Communication |
| 10735 | Snapchat | SOCIAL | 4.0 | 17014787 | 5.30 | 500,000,000 | Free | 0 | Teen | Social |
| 10489 | 360 Security - Free Antivirus, Booster, Cleaner | TOOLS | 4.6 | 16771865 | 3.40 | 100,000,000 | Free | 0 | Everyone | Tools |
| 10731 | My Talking Tom | GAME | 4.5 | 14891223 | 36.00 | 500,000,000 | Free | 0 | Everyone | Casual |
| 10594 | 8 Ball Pool | GAME | 4.5 | 14198297 | 52.00 | 100,000,000 | Free | 0 | Everyone | Sports |
| 10302 | DU Battery Saver - Battery Charger & Battery Life | TOOLS | 4.5 | 13479633 | 14.00 | 100,000,000 | Free | 0 | Everyone | Tools |
| 10354 | BBM - Free Calls & Messages | COMMUNICATION | 4.3 | 12842860 | 3.50 | 100,000,000 | Free | 0 | Everyone | Communication |
| 10549 | Cache Cleaner-DU Speed Booster (booster & clea... | TOOLS | 4.5 | 12759663 | 15.00 | 100,000,000 | Free | 0 | Everyone | Tools |
| 10757 | NEWS_AND_MAGAZINES | 4.3 | 11667403 | 6.30 | 500,000,000 | Free | 0 | Mature 17+ | News & Magazines | |
| 10721 | Viber Messenger | COMMUNICATION | 4.3 | 11334799 | 3.50 | 500,000,000 | Free | 0 | Everyone | Communication |
| 10578 | Shadow Fight 2 | GAME | 4.6 | 10979062 | 88.00 | 100,000,000 | Free | 0 | Everyone 10+ | Action |
| 10813 | Google Photos | PHOTOGRAPHY | 4.5 | 10858556 | 6.90 | 1,000,000,000 | Free | 0 | Everyone | Photography |
| 10724 | LINE: Free Calls & Messages | COMMUNICATION | 4.2 | 10790289 | 3.50 | 500,000,000 | Free | 0 | Everyone | Communication |
| 10717 | Pou | GAME | 4.3 | 10485308 | 24.00 | 500,000,000 | Free | 0 | Everyone | Casual |
| 10792 | Skype - free IM & video calls | COMMUNICATION | 4.1 | 10484169 | 3.50 | 1,000,000,000 | Free | 0 | Everyone | Communication |
| 10628 | Pokémon GO | GAME | 4.1 | 10424925 | 85.00 | 100,000,000 | Free | 0 | Everyone | Adventure |
| 10388 | Minion Rush: Despicable Me Official Game | GAME | 4.5 | 10216538 | 36.00 | 100,000,000 | Free | 0 | Everyone 10+ | Casual;Action & Adventure |
| 10694 | Yes day | GAME | 4.5 | 10055521 | 94.00 | 100,000,000 | Free | 0 | Everyone | Casual |
| 10695 | Hay Day | FAMILY | 4.5 | 10053186 | 94.00 | 100,000,000 | Free | 0 | Everyone | Casual |
| 10644 | Dream League Soccer 2018 | GAME | 4.6 | 9882639 | 74.00 | 100,000,000 | Free | 0 | Everyone | Sports |
| 10696 | My Talking Angela | GAME | 4.5 | 9881829 | 99.00 | 100,000,000 | Free | 0 | Everyone | Casual |
| 10660 | VivaVideo - Video Editor & Photo Movie | VIDEO_PLAYERS | 4.6 | 9879473 | 40.00 | 100,000,000 | Free | 0 | Teen | Video Players & Editors |
| 10786 | Google Chrome: Fast & Secure | COMMUNICATION | 4.3 | 9642995 | 3.50 | 1,000,000,000 | Free | 0 | Everyone | Communication |
| 10817 | Maps - Navigate & Explore | TRAVEL_AND_LOCAL | 4.3 | 9235155 | 12.00 | 1,000,000,000 | Free | 0 | Everyone | Travel & Local |
| 10672 | Hill Climb Racing | GAME | 4.4 | 8923587 | 63.00 | 100,000,000 | Free | 0 | Everyone | Racing |
| 10734 | Facebook Lite | SOCIAL | 4.3 | 8606259 | 5.30 | 500,000,000 | Free | 0 | Teen | Social |
| 10649 | Asphalt 8: Airborne | GAME | 4.5 | 8389714 | 92.00 | 100,000,000 | Free | 0 | Teen | Racing |
| 10699 | Mobile Legends: Bang Bang | GAME | 4.4 | 8219586 | 99.00 | 100,000,000 | Free | 0 | Teen | Action |
| 10322 | Battery Doctor-Battery Life Saver & Battery Co... | TOOLS | 4.5 | 8190074 | 17.00 | 100,000,000 | Free | 0 | Everyone | Tools |
| 10396 | Piano Tiles 2™ | GAME | 4.7 | 8118880 | 36.00 | 100,000,000 | Free | 0 | Everyone | Arcade |
| 10777 | Temple Run 2 | GAME | 4.3 | 8118609 | 62.00 | 500,000,000 | Free | 0 | Everyone | Action |
| 10822 | TOOLS | 4.4 | 8033493 | 3.40 | 1,000,000,000 | Free | 0 | Everyone | Tools | |
| 10359 | Truecaller: Caller ID, SMS spam blocking & Dialer | COMMUNICATION | 4.5 | 7820209 | 3.50 | 100,000,000 | Free | 0 | Everyone | Communication |
| 10711 | SHAREit - Transfer & Share | TOOLS | 4.6 | 7790693 | 17.00 | 500,000,000 | Free | 0 | Everyone | Tools |
| 10389 | Sniper 3D Gun Shooter: Free Shooting Games - FPS | GAME | 4.6 | 7671249 | 36.00 | 100,000,000 | Free | 0 | Mature 17+ | Action |
| 10676 | Farm Heroes Saga | GAME | 4.4 | 7614130 | 70.00 | 100,000,000 | Free | 0 | Everyone | Casual |
| 10576 | PicsArt Photo Studio: Collage Maker & Pic Editor | PHOTOGRAPHY | 4.5 | 7594559 | 34.00 | 100,000,000 | Free | 0 | Teen | Photography |
| 10461 | PhotoGrid: Video & Pic Collage Maker, Photo Ed... | PHOTOGRAPHY | 4.6 | 7529865 | 6.90 | 100,000,000 | Free | 0 | Everyone | Photography |
| 10502 | GO Launcher - 3D parallax Themes & HD Wallpapers | PERSONALIZATION | 4.5 | 7464996 | 6.15 | 100,000,000 | Free | 0 | Everyone | Personalization |
If you look at the number of reviews, you can find the most popular apps on the Android App Store. These include the usual suspects: Facebook, WhatsApp, Instagram etc. What’s also notable is that the list of the top 50 most reviewed apps does not include a single paid app!
All Android apps have a content rating like “Everyone” or “Teen” or “Mature 17+”. Let’s take a look at the distribution of the content ratings in our dataset and see how to visualise it with plotly - a popular data visualisation library that you can use alongside or instead of Matplotlib.
# First, we’ll count the number of occurrences of each rating with .value_counts():
ratings = df_apps_clean.Content_Rating.value_counts()
ratings
Everyone 6621 Teen 912 Mature 17+ 357 Everyone 10+ 305 Adults only 18+ 3 Unrated 1 Name: Content_Rating, dtype: int64
The first step in creating charts with plotly is to import plotly.express. This is the fastest way to create a beautiful graphic with a minimal amount of code in plotly.
import pandas as pd
import plotly.express as px
To create a pie chart we simply call px.pie() and then .show() the resulting figure. Plotly refers to all their figures, be they line charts, bar charts, or pie charts as graph_objects.
fig = px.pie(labels = ratings.index, values = ratings.values)
fig.show()
Let’s customise our pie chart. Looking at the .pie() documentation we see a number of parameters that we can set, like title or names. https://plotly.com/python-api-reference/generated/plotly.express.pie.html
plotly.express.pie(data_frame=None, names=None, values=None, color=None, color_discrete_sequence=None, color_discrete_map=None, hover_name=None, hover_data=None, custom_data=None, labels=None, title=None, template=None, width=None, height=None, opacity=None, hole=None)
If you’d like to configure other aspects of the chart, that you can’t see in the list of parameters, you can call a method called .update_traces(). In plotly lingo, “traces” refer to graphical marks on a figure. Think of “traces” as collections of attributes. Here we update the traces to change how the text is displayed.
fig = px.pie(labels = ratings.index, values = ratings.values, title = "Content Rating", names = ratings.index)
fig.update_traces(textposition ='outside', textinfo ='percent+label')
fig.show()
To create a donut 🍩 chart, we can simply add a value for the hole argument:
fig = px.pie(labels=ratings.index,
values=ratings.values,
title="Content Rating",
names=ratings.index,
hole=0.6,
)
fig.update_traces(textposition='inside', textfont_size=15, textinfo='percent')
fig.show()
df_apps_clean.head()
| App | Category | Rating | Reviews | Size_MBs | Installs | Type | Price | Content_Rating | Genres | |
|---|---|---|---|---|---|---|---|---|---|---|
| 21 | KBA-EZ Health Guide | MEDICAL | 5.0 | 4 | 25.0 | 1 | Free | 0 | Everyone | Medical |
| 28 | Ra Ga Ba | GAME | 5.0 | 2 | 20.0 | 1 | Paid | $1.49 | Everyone | Arcade |
| 47 | Mu.F.O. | GAME | 5.0 | 2 | 16.0 | 1 | Paid | $0.99 | Everyone | Arcade |
| 82 | Brick Breaker BR | GAME | 5.0 | 7 | 19.0 | 5 | Free | 0 | Everyone | Arcade |
| 99 | Anatomy & Physiology Vocabulary Exam Review App | MEDICAL | 5.0 | 1 | 4.6 | 5 | Free | 0 | Everyone | Medical |
How many apps had over 1 billion (that's right - BILLION) installations? How many apps just had a single install?
df_apps_clean.sort_values('Installs', ascending=False).head(50)
| App | Category | Rating | Reviews | Size_MBs | Installs | Type | Price | Content_Rating | Genres | |
|---|---|---|---|---|---|---|---|---|---|---|
| 10731 | My Talking Tom | GAME | 4.5 | 14891223 | 36.00 | 500,000,000 | Free | 0 | Everyone | Casual |
| 10746 | Security Master - Antivirus, VPN, AppLock, Boo... | TOOLS | 4.7 | 24900999 | 3.40 | 500,000,000 | Free | 0 | Everyone | Tools |
| 10711 | SHAREit - Transfer & Share | TOOLS | 4.6 | 7790693 | 17.00 | 500,000,000 | Free | 0 | Everyone | Tools |
| 10713 | imo free video calls and chat | COMMUNICATION | 4.3 | 4785892 | 11.00 | 500,000,000 | Free | 0 | Everyone | Communication |
| 10717 | Pou | GAME | 4.3 | 10485308 | 24.00 | 500,000,000 | Free | 0 | Everyone | Casual |
| 10721 | Viber Messenger | COMMUNICATION | 4.3 | 11334799 | 3.50 | 500,000,000 | Free | 0 | Everyone | Communication |
| 10722 | Google Duo - High Quality Video Calls | COMMUNICATION | 4.6 | 2083237 | 3.50 | 500,000,000 | Free | 0 | Everyone | Communication |
| 10724 | LINE: Free Calls & Messages | COMMUNICATION | 4.2 | 10790289 | 3.50 | 500,000,000 | Free | 0 | Everyone | Communication |
| 10734 | Facebook Lite | SOCIAL | 4.3 | 8606259 | 5.30 | 500,000,000 | Free | 0 | Teen | Social |
| 10735 | Snapchat | SOCIAL | 4.0 | 17014787 | 5.30 | 500,000,000 | Free | 0 | Teen | Social |
| 10740 | Google Translate | TOOLS | 4.4 | 5745093 | 3.40 | 500,000,000 | Free | 0 | Everyone | Tools |
| 10744 | Clean Master- Space Cleaner & Antivirus | TOOLS | 4.7 | 42916526 | 3.40 | 500,000,000 | Free | 0 | Everyone | Tools |
| 10741 | Gboard - the Google Keyboard | TOOLS | 4.2 | 1859115 | 3.40 | 500,000,000 | Free | 0 | Everyone | Tools |
| 10747 | Microsoft Word | PRODUCTIVITY | 4.5 | 2084126 | 4.00 | 500,000,000 | Free | 0 | Everyone | Productivity |
| 10758 | Flipboard: News For Our Time | NEWS_AND_MAGAZINES | 4.4 | 1284017 | 6.30 | 500,000,000 | Free | 0 | Everyone 10+ | News & Magazines |
| 10777 | Temple Run 2 | GAME | 4.3 | 8118609 | 62.00 | 500,000,000 | Free | 0 | Everyone | Action |
| 10776 | Samsung Health | HEALTH_AND_FITNESS | 4.3 | 480208 | 70.00 | 500,000,000 | Free | 0 | Everyone | Health & Fitness |
| 10773 | Dropbox | PRODUCTIVITY | 4.4 | 1861310 | 61.00 | 500,000,000 | Free | 0 | Everyone | Productivity |
| 10770 | UC Browser - Fast Download Private & Secure | COMMUNICATION | 4.5 | 17712922 | 40.00 | 500,000,000 | Free | 0 | Teen | Communication |
| 10748 | Google Calendar | PRODUCTIVITY | 4.2 | 858208 | 4.00 | 500,000,000 | Free | 0 | Everyone | Productivity |
| 10763 | Candy Crush Saga | GAME | 4.4 | 22426677 | 74.00 | 500,000,000 | Free | 0 | Everyone | Casual |
| 10757 | NEWS_AND_MAGAZINES | 4.3 | 11667403 | 6.30 | 500,000,000 | Free | 0 | Mature 17+ | News & Magazines | |
| 10754 | MX Player | VIDEO_PLAYERS | 4.5 | 6474426 | 4.65 | 500,000,000 | Free | 0 | Everyone | Video Players & Editors |
| 10752 | Cloud Print | PRODUCTIVITY | 4.1 | 282460 | 4.00 | 500,000,000 | Free | 0 | Everyone | Productivity |
| 6075 | WICShopper | SHOPPING | 3.9 | 3023 | 9.30 | 500,000 | Free | 0 | Everyone | Shopping |
| 6070 | Mayo Clinic | MEDICAL | 4.3 | 2218 | 11.00 | 500,000 | Free | 0 | Everyone | Medical |
| 6074 | Puffin for Facebook | SOCIAL | 4.0 | 10743 | 5.30 | 500,000 | Free | 0 | Teen | Social |
| 6073 | Stream - Live Video Community | SOCIAL | 4.1 | 6388 | 5.30 | 500,000 | Free | 0 | Teen | Social |
| 6072 | Bloglovin' | SOCIAL | 3.9 | 8936 | 5.30 | 500,000 | Free | 0 | Everyone | Social |
| 6071 | Blood Pressure Log - bpresso.com | MEDICAL | 4.2 | 5661 | 11.00 | 500,000 | Free | 0 | Everyone | Medical |
| 5906 | Mosaic puzzles | FAMILY | 4.4 | 1595 | 14.00 | 500,000 | Free | 0 | Everyone | Puzzle;Brain Games |
| 6069 | Dolphin and fish coloring book | FAMILY | 3.9 | 2249 | 19.00 | 500,000 | Free | 0 | Everyone | Art & Design;Creativity |
| 6068 | Messenger Kids – Safer Messaging and Video Chat | FAMILY | 4.2 | 3478 | 19.00 | 500,000 | Free | 0 | Everyone | Communication;Creativity |
| 6066 | Battle Gems (AdventureQuest) | FAMILY | 4.4 | 48427 | 19.00 | 500,000 | Free | 0 | Everyone 10+ | Puzzle |
| 6065 | SuperBikers 2 | GAME | 3.9 | 6200 | 36.00 | 500,000 | Free | 0 | Everyone 10+ | Racing |
| 6064 | AE Spider Solitaire | GAME | 4.5 | 17263 | 36.00 | 500,000 | Free | 0 | Everyone | Card |
| 6077 | Trip by Skyscanner - City & Travel Guide | TRAVEL_AND_LOCAL | 4.1 | 5150 | 12.00 | 500,000 | Free | 0 | Everyone | Travel & Local |
| 6076 | AE Archer | SPORTS | 3.9 | 8638 | 14.00 | 500,000 | Free | 0 | Everyone | Sports |
| 6090 | Beach Shoot Em Up: Head Hunter | GAME | 4.4 | 1218 | 15.00 | 500,000 | Free | 0 | Everyone 10+ | Action |
| 6078 | Navmii GPS USA (Navfree) | TRAVEL_AND_LOCAL | 4.0 | 5960 | 12.00 | 500,000 | Free | 0 | Everyone | Travel & Local |
| 6079 | Planning Center Services | PRODUCTIVITY | 4.3 | 5157 | 4.00 | 500,000 | Free | 0 | Everyone | Productivity |
| 6080 | Official QR Code® Reader "Q" | PRODUCTIVITY | 4.4 | 3031 | 4.00 | 500,000 | Free | 0 | Everyone | Productivity |
| 6081 | Family GPS Tracker and Chat + Baby Monitor Online | PARENTING | 4.4 | 9073 | 10.05 | 500,000 | Free | 0 | Everyone | Parenting |
| 6082 | Weather Live | WEATHER | 4.5 | 76593 | 4.75 | 500,000 | Paid | $5.99 | Everyone | Weather |
| 6083 | How To Color Disney Princess - Coloring Pages | ART_AND_DESIGN | 4.0 | 591 | 9.40 | 500,000 | Free | 0 | Everyone | Art & Design |
| 6084 | Barcode Scanner | LIBRARIES_AND_DEMO | 4.2 | 3945 | 9.40 | 500,000 | Free | 0 | Everyone | Libraries & Demo |
| 6085 | Profile pictures for WhatsApp | PERSONALIZATION | 4.7 | 10786 | 9.40 | 500,000 | Free | 0 | Mature 17+ | Personalization |
| 6086 | Supervision service | AUTO_AND_VEHICLES | 4.0 | 2155 | 15.00 | 500,000 | Free | 0 | Everyone | Auto & Vehicles |
| 6087 | TryDate - Free Online Dating App, Chat Meet Ad... | DATING | 4.4 | 7888 | 15.00 | 500,000 | Free | 0 | Mature 17+ | Dating |
| 6088 | Simple Recipes | FOOD_AND_DRINK | 4.7 | 3803 | 15.00 | 500,000 | Free | 0 | Everyone | Food & Drink |
df_apps_clean.sort_values('Installs', ascending=True).head(50)
| App | Category | Rating | Reviews | Size_MBs | Installs | Type | Price | Content_Rating | Genres | |
|---|---|---|---|---|---|---|---|---|---|---|
| 21 | KBA-EZ Health Guide | MEDICAL | 5.0 | 4 | 25.0 | 1 | Free | 0 | Everyone | Medical |
| 28 | Ra Ga Ba | GAME | 5.0 | 2 | 20.0 | 1 | Paid | $1.49 | Everyone | Arcade |
| 47 | Mu.F.O. | GAME | 5.0 | 2 | 16.0 | 1 | Paid | $0.99 | Everyone | Arcade |
| 2136 | GKPB FP Online Church | LIFESTYLE | 5.0 | 32 | 7.9 | 1,000 | Free | 0 | Everyone | Lifestyle |
| 2134 | SCI-Ex | HEALTH_AND_FITNESS | 3.5 | 8 | 7.9 | 1,000 | Free | 0 | Everyone | Health & Fitness |
| 2133 | Sci Fi Sounds | FAMILY | 3.2 | 4 | 8.0 | 1,000 | Free | 0 | Everyone | Entertainment |
| 2132 | DH UFO | FAMILY | 3.0 | 4 | 8.0 | 1,000 | Free | 0 | Everyone | Entertainment |
| 2131 | Discovery Church Florida | LIFESTYLE | 4.7 | 40 | 8.0 | 1,000 | Free | 0 | Teen | Lifestyle |
| 2130 | Ex Service Taxis | BUSINESS | 3.5 | 15 | 8.0 | 1,000 | Free | 0 | Everyone | Business |
| 2129 | DF@realtime | BUSINESS | 4.3 | 24 | 8.0 | 1,000 | Free | 0 | Everyone | Business |
| 2127 | Puck AI Personal Assistant Robot | PRODUCTIVITY | 4.1 | 22 | 26.0 | 1,000 | Free | 0 | Everyone | Productivity |
| 2125 | BP Journal - Blood Pressure Diary | MEDICAL | 5.0 | 6 | 26.0 | 1,000 | Free | 0 | Everyone | Medical |
| 2124 | FD VR Music Videos - MTV Pop and Rap in 360 | FAMILY | 4.9 | 15 | 26.0 | 1,000 | Free | 0 | Everyone | Entertainment |
| 2123 | BJ-FPV | FAMILY | 4.2 | 16 | 26.0 | 1,000 | Free | 0 | Everyone | Casual |
| 2119 | NewTek NDI | PHOTOGRAPHY | 3.5 | 77 | 1.2 | 1,000 | Paid | $19.99 | Everyone | Photography |
| 2118 | Go Go Coupons - Free Coupon and Discount | SHOPPING | 3.0 | 4 | 1.2 | 1,000 | Free | 0 | Everyone | Shopping |
| 2117 | OmniMedix Medical Calculator | MEDICAL | 4.7 | 25 | 1.2 | 1,000 | Paid | $4.99 | Everyone | Medical |
| 2116 | Dr. McDougall Mobile Cookbook | HEALTH_AND_FITNESS | 3.8 | 76 | 1.2 | 1,000 | Paid | $4.99 | Everyone | Health & Fitness |
| 2111 | FB Photographie | PHOTOGRAPHY | 4.7 | 37 | 10.0 | 1,000 | Free | 0 | Everyone | Photography |
| 2137 | Beck & Bo: Toddler First Words | FAMILY | 4.3 | 41 | 7.9 | 1,000 | Paid | $2.99 | Everyone | Education;Pretend Play |
| 2138 | DS-L4 Viewer | PHOTOGRAPHY | 3.5 | 13 | 7.9 | 1,000 | Free | 0 | Everyone | Photography |
| 2142 | CS Customizer | COMMUNICATION | 3.7 | 25 | 3.7 | 1,000 | Free | 0 | Everyone | Communication |
| 2110 | Simple Blood Pressure log | MEDICAL | 3.9 | 31 | 10.0 | 1,000 | Free | 0 | Everyone | Medical |
| 2161 | Hactar Go | FAMILY | 4.8 | 97 | 3.5 | 1,000 | Paid | $2.99 | Everyone | Board;Brain Games |
| 2159 | X Launcher Prime: With OS Style Theme & No Ads | ART_AND_DESIGN | 4.7 | 149 | 3.5 | 1,000 | Paid | $1.99 | Everyone | Art & Design |
| 2158 | DW فارسی By dw-arab.com | NEWS_AND_MAGAZINES | 4.7 | 11 | 4.4 | 1,000 | Free | 0 | Everyone | News & Magazines |
| 2157 | Basket Manager 2016 Pro | SPORTS | 4.5 | 117 | 4.4 | 1,000 | Paid | $0.99 | Everyone | Sports |
| 2156 | Pocket AC | PHOTOGRAPHY | 4.8 | 130 | 4.4 | 1,000 | Paid | $9.99 | Everyone | Photography |
| 2155 | Deaf Interpreter | MEDICAL | 3.8 | 24 | 4.4 | 1,000 | Free | 0 | Everyone | Medical |
| 2154 | Learn Music Notes | FAMILY | 4.7 | 143 | 4.4 | 1,000 | Paid | $1.99 | Everyone | Music;Music & Video |
| 2153 | INTERACTIVE CALCULUS FOR MATHS AND PHYSICS | FAMILY | 4.8 | 53 | 4.4 | 1,000 | Free | 0 | Everyone | Education |
| 2152 | American Girls Mobile Numbers | DATING | 5.0 | 5 | 4.4 | 1,000 | Free | 0 | Mature 17+ | Dating |
| 2151 | CARDI B WALLPAPERS | PERSONALIZATION | 4.1 | 8 | 4.5 | 1,000 | Free | 0 | Everyone | Personalization |
| 2149 | Q downloader : download your social media videos | FAMILY | 4.1 | 41 | 4.5 | 1,000 | Free | 0 | Everyone | Entertainment |
| 2148 | Recycling Cy | LIFESTYLE | 4.0 | 47 | 4.5 | 1,000 | Free | 0 | Everyone | Lifestyle |
| 2147 | dt Pro | FINANCE | 4.8 | 4 | 4.5 | 1,000 | Free | 0 | Everyone | Finance |
| 2146 | FREE VIDEO CHAT - LIVE VIDEO AND TEXT CHAT | DATING | 4.8 | 84 | 4.5 | 1,000 | Free | 0 | Mature 17+ | Dating |
| 2145 | صور حرف H | ART_AND_DESIGN | 4.4 | 13 | 4.5 | 1,000 | Free | 0 | Everyone | Art & Design |
| 2144 | Android P Stock Wallpapers | PERSONALIZATION | 4.5 | 16 | 3.7 | 1,000 | Free | 0 | Everyone | Personalization |
| 2143 | Bitcoin & Cryptocurrency - Bx | FINANCE | 4.8 | 11 | 3.7 | 1,000 | Free | 0 | Everyone | Finance |
| 2140 | AZ Mobile Gizmo | BUSINESS | 4.4 | 16 | 3.7 | 1,000 | Free | 0 | Everyone | Business |
| 2109 | DL Hughley | FAMILY | 4.6 | 12 | 10.0 | 1,000 | Free | 0 | Mature 17+ | Entertainment |
| 2105 | CH Kadels | BUSINESS | 4.4 | 36 | 10.0 | 1,000 | Free | 0 | Everyone | Business |
| 2162 | Flim Af Somali Hindi Fanproj | FAMILY | 3.9 | 12 | 3.5 | 1,000 | Free | 0 | Everyone | Entertainment |
| 2079 | m>notes notepad | PRODUCTIVITY | 4.3 | 184 | 4.0 | 1,000 | Paid | $2.99 | Everyone | Productivity |
| 2077 | AV Tools Pro | TOOLS | 4.3 | 35 | 3.4 | 1,000 | Paid | $2.49 | Everyone | Tools |
| 2076 | BG Metro - Red voznje | TRAVEL_AND_LOCAL | 4.8 | 89 | 12.0 | 1,000 | Free | 0 | Everyone | Travel & Local |
| 2073 | CG Districts | SOCIAL | 3.8 | 14 | 5.3 | 1,000 | Free | 0 | Everyone | Social |
| 2072 | bpresso PRO | MEDICAL | 4.4 | 515 | 11.0 | 1,000 | Paid | $5.49 | Everyone | Medical |
| 2070 | VeinSeek | MEDICAL | 2.5 | 79 | 11.0 | 1,000 | Paid | $3.99 | Everyone | Medical |
To check the data types you can either use .describe() on the column or .info() on the DataFrame.
df_apps_clean.Installs.describe()
count 8199 unique 19 top 1,000,000 freq 1417 Name: Installs, dtype: object
The "Installs" datatype is Name: Installs, dtype: object
df_apps_clean.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 8199 entries, 21 to 10835 Data columns (total 10 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 App 8199 non-null object 1 Category 8199 non-null object 2 Rating 8199 non-null float64 3 Reviews 8199 non-null int64 4 Size_MBs 8199 non-null float64 5 Installs 8199 non-null object 6 Type 8199 non-null object 7 Price 8199 non-null object 8 Content_Rating 8199 non-null object 9 Genres 8199 non-null object dtypes: float64(2), int64(1), object(7) memory usage: 704.6+ KB
Here we can see: 5 Installs 8199 non-null object
Both of these show that we are dealing with a non-numeric data type. In this case, the type is "object".
If we take two of the columns, say Installs and the App name, we can count the number of entries per level of installations with .groupby() and .count(). However, because we are dealing with a non-numeric data type, the ordering is not helpful. The reason Python is not recognising our installs as numbers is because of the comma (,) characters.
df_apps_clean[["App", "Installs"]].groupby("Installs").count()
| App | |
|---|---|
| Installs | |
| 1 | 3 |
| 1,000 | 698 |
| 1,000,000 | 1417 |
| 1,000,000,000 | 20 |
| 10 | 69 |
| 10,000 | 988 |
| 10,000,000 | 933 |
| 100 | 303 |
| 100,000 | 1096 |
| 100,000,000 | 189 |
| 5 | 9 |
| 5,000 | 425 |
| 5,000,000 | 607 |
| 50 | 56 |
| 50,000 | 457 |
| 50,000,000 | 202 |
| 500 | 199 |
| 500,000 | 504 |
| 500,000,000 | 24 |
We can remove the comma (,) character - or any character for that matter - from a DataFrame using the string’s .replace() method. Here we’re saying: “replace the , with an empty string”. This completely removes all the commas in the Installs column. We can then convert our data to a number using .to_numeric().
df_apps_clean.Installs = df_apps_clean.Installs.astype(str).str.replace(',', "")
df_apps_clean[["App", "Installs"]].groupby("Installs").count()
| App | |
|---|---|
| Installs | |
| 1 | 3 |
| 10 | 69 |
| 100 | 303 |
| 1000 | 698 |
| 10000 | 988 |
| 100000 | 1096 |
| 1000000 | 1417 |
| 10000000 | 933 |
| 100000000 | 189 |
| 1000000000 | 20 |
| 5 | 9 |
| 50 | 56 |
| 500 | 199 |
| 5000 | 425 |
| 50000 | 457 |
| 500000 | 504 |
| 5000000 | 607 |
| 50000000 | 202 |
| 500000000 | 24 |
We can then convert our data to a number using .to_numeric().
df_apps_clean.Installs = pd.to_numeric(df_apps_clean.Installs)
df_apps_clean.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 8199 entries, 21 to 10835 Data columns (total 10 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 App 8199 non-null object 1 Category 8199 non-null object 2 Rating 8199 non-null float64 3 Reviews 8199 non-null int64 4 Size_MBs 8199 non-null float64 5 Installs 8199 non-null int64 6 Type 8199 non-null object 7 Price 8199 non-null object 8 Content_Rating 8199 non-null object 9 Genres 8199 non-null object dtypes: float64(2), int64(2), object(6) memory usage: 704.6+ KB
df_apps_clean[["App", "Installs"]].groupby("Installs").count()
| App | |
|---|---|
| Installs | |
| 1 | 3 |
| 5 | 9 |
| 10 | 69 |
| 50 | 56 |
| 100 | 303 |
| 500 | 199 |
| 1000 | 698 |
| 5000 | 425 |
| 10000 | 988 |
| 50000 | 457 |
| 100000 | 1096 |
| 500000 | 504 |
| 1000000 | 1417 |
| 5000000 | 607 |
| 10000000 | 933 |
| 50000000 | 202 |
| 100000000 | 189 |
| 500000000 | 24 |
| 1000000000 | 20 |
df_apps_clean.head(3)
| App | Category | Rating | Reviews | Size_MBs | Installs | Type | Price | Content_Rating | Genres | |
|---|---|---|---|---|---|---|---|---|---|---|
| 21 | KBA-EZ Health Guide | MEDICAL | 5.0 | 4 | 25.0 | 1 | Free | 0 | Everyone | Medical |
| 28 | Ra Ga Ba | GAME | 5.0 | 2 | 20.0 | 1 | Paid | $1.49 | Everyone | Arcade |
| 47 | Mu.F.O. | GAME | 5.0 | 2 | 16.0 | 1 | Paid | $0.99 | Everyone | Arcade |
df_apps_clean.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 8199 entries, 21 to 10835 Data columns (total 10 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 App 8199 non-null object 1 Category 8199 non-null object 2 Rating 8199 non-null float64 3 Reviews 8199 non-null int64 4 Size_MBs 8199 non-null float64 5 Installs 8199 non-null int64 6 Type 8199 non-null object 7 Price 8199 non-null object 8 Content_Rating 8199 non-null object 9 Genres 8199 non-null object dtypes: float64(2), int64(2), object(6) memory usage: 704.6+ KB
We can see that the data type of Price column is: 7 Price 8199 non-null object
df_apps_clean[["App", "Price"]].groupby("Price").count()
| App | |
|---|---|
| Price | |
| $0.99 | 104 |
| $1.00 | 2 |
| $1.20 | 1 |
| $1.29 | 1 |
| $1.49 | 31 |
| ... | ... |
| $8.49 | 1 |
| $8.99 | 4 |
| $9.00 | 1 |
| $9.99 | 14 |
| 0 | 7595 |
73 rows × 1 columns
We can delete $ from our price and convert it to numeric data type:
df_apps_clean.Price = df_apps_clean.Price.astype(str).str.replace('$', "")
df_apps_clean[["App", "Price"]].groupby("Price").count()
<ipython-input-20-d5aa575d2558>:1: FutureWarning: The default value of regex will change from True to False in a future version. In addition, single character regular expressions will*not* be treated as literal strings when regex=True.
| App | |
|---|---|
| Price | |
| 0 | 7595 |
| 0.99 | 104 |
| 1.00 | 2 |
| 1.20 | 1 |
| 1.29 | 1 |
| ... | ... |
| 79.99 | 1 |
| 8.49 | 1 |
| 8.99 | 4 |
| 9.00 | 1 |
| 9.99 | 14 |
73 rows × 1 columns
df_apps_clean.Price = pd.to_numeric(df_apps_clean.Price)
df_apps_clean.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 8199 entries, 21 to 10835 Data columns (total 10 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 App 8199 non-null object 1 Category 8199 non-null object 2 Rating 8199 non-null float64 3 Reviews 8199 non-null int64 4 Size_MBs 8199 non-null float64 5 Installs 8199 non-null int64 6 Type 8199 non-null object 7 Price 8199 non-null float64 8 Content_Rating 8199 non-null object 9 Genres 8199 non-null object dtypes: float64(3), int64(2), object(5) memory usage: 704.6+ KB
df_apps_clean[["App", "Price"]].groupby("Price").count()
| App | |
|---|---|
| Price | |
| 0.00 | 7595 |
| 0.99 | 104 |
| 1.00 | 2 |
| 1.20 | 1 |
| 1.29 | 1 |
| ... | ... |
| 299.99 | 1 |
| 379.99 | 1 |
| 389.99 | 1 |
| 399.99 | 11 |
| 400.00 | 1 |
73 rows × 1 columns
df_apps_clean.sort_values('Price', ascending=False).head(20)
| App | Category | Rating | Reviews | Size_MBs | Installs | Type | Price | Content_Rating | Genres | |
|---|---|---|---|---|---|---|---|---|---|---|
| 3946 | I'm Rich - Trump Edition | LIFESTYLE | 3.6 | 275 | 7.300000 | 10000 | Paid | 400.00 | Everyone | Lifestyle |
| 2461 | I AM RICH PRO PLUS | FINANCE | 4.0 | 36 | 41.000000 | 1000 | Paid | 399.99 | Everyone | Finance |
| 4606 | I Am Rich Premium | FINANCE | 4.1 | 1867 | 4.700000 | 50000 | Paid | 399.99 | Everyone | Finance |
| 3145 | I am rich(premium) | FINANCE | 3.5 | 472 | 0.942383 | 5000 | Paid | 399.99 | Everyone | Finance |
| 3554 | 💎 I'm rich | LIFESTYLE | 3.8 | 718 | 26.000000 | 10000 | Paid | 399.99 | Everyone | Lifestyle |
| 5765 | I am rich | LIFESTYLE | 3.8 | 3547 | 1.800000 | 100000 | Paid | 399.99 | Everyone | Lifestyle |
| 1946 | I am rich (Most expensive app) | FINANCE | 4.1 | 129 | 2.700000 | 1000 | Paid | 399.99 | Teen | Finance |
| 2775 | I Am Rich Pro | FAMILY | 4.4 | 201 | 2.700000 | 5000 | Paid | 399.99 | Everyone | Entertainment |
| 3221 | I am Rich Plus | FAMILY | 4.0 | 856 | 8.700000 | 10000 | Paid | 399.99 | Everyone | Entertainment |
| 3114 | I am Rich | FINANCE | 4.3 | 180 | 3.800000 | 5000 | Paid | 399.99 | Everyone | Finance |
| 1331 | most expensive app (H) | FAMILY | 4.3 | 6 | 1.500000 | 100 | Paid | 399.99 | Everyone | Entertainment |
| 2394 | I am Rich! | FINANCE | 3.8 | 93 | 22.000000 | 1000 | Paid | 399.99 | Everyone | Finance |
| 3897 | I Am Rich | FAMILY | 3.6 | 217 | 4.900000 | 10000 | Paid | 389.99 | Everyone | Entertainment |
| 2193 | I am extremely Rich | LIFESTYLE | 2.9 | 41 | 2.900000 | 1000 | Paid | 379.99 | Everyone | Lifestyle |
| 3856 | I am rich VIP | LIFESTYLE | 3.8 | 411 | 2.600000 | 10000 | Paid | 299.99 | Everyone | Lifestyle |
| 2281 | Vargo Anesthesia Mega App | MEDICAL | 4.6 | 92 | 32.000000 | 1000 | Paid | 79.99 | Everyone | Medical |
| 1407 | LTC AS Legal | MEDICAL | 4.0 | 6 | 1.300000 | 100 | Paid | 39.99 | Everyone | Medical |
| 2629 | I am Rich Person | LIFESTYLE | 4.2 | 134 | 1.800000 | 1000 | Paid | 37.99 | Everyone | Lifestyle |
| 2481 | A Manual of Acupuncture | MEDICAL | 3.5 | 214 | 68.000000 | 1000 | Paid | 33.99 | Everyone | Medical |
| 4264 | Golfshot Plus: Golf GPS | SPORTS | 4.1 | 3387 | 25.000000 | 50000 | Paid | 29.99 | Everyone | Sports |
Here we can see that 5 top apps - are the same app!
Add a column called 'Revenue_Estimate' to the DataFrame. This column should hold the price of the app times the number of installs. What are the top 10 highest-grossing paid apps according to this estimate? Out of the top 10, how many are games?
What’s going on here? There are 15 I am Rich Apps in the Google Play Store apparently. They all cost 300 or more, which is the main point of the app. The story goes that in 2008, Armin Heinrich released the very first I am Rich app in the iOS App Store for 999.90. The app does absolutely nothing. It just displays the picture of a gemstone and can be used to prove to your friends how rich you are. Armin actually made a total of 7 sales before the app was hastily removed by Apple. Nonetheless, it inspired a bunch of copycats on the Android App Store, but if you search today, you’ll find all of these apps have disappeared as well. The high installation numbers are likely gamed by making the app was available for free at some point to get reviews and appear more legitimate.
Leaving this bad data in our dataset will misrepresent our analysis of the most expensive 'real' apps. Here’s how we can remove these rows:
df_apps_clean = df_apps_clean[df_apps_clean['Price'] < 250]
df_apps_clean.sort_values('Price', ascending=False).head(5)
| App | Category | Rating | Reviews | Size_MBs | Installs | Type | Price | Content_Rating | Genres | |
|---|---|---|---|---|---|---|---|---|---|---|
| 2281 | Vargo Anesthesia Mega App | MEDICAL | 4.6 | 92 | 32.0 | 1000 | Paid | 79.99 | Everyone | Medical |
| 1407 | LTC AS Legal | MEDICAL | 4.0 | 6 | 1.3 | 100 | Paid | 39.99 | Everyone | Medical |
| 2629 | I am Rich Person | LIFESTYLE | 4.2 | 134 | 1.8 | 1000 | Paid | 37.99 | Everyone | Lifestyle |
| 2481 | A Manual of Acupuncture | MEDICAL | 3.5 | 214 | 68.0 | 1000 | Paid | 33.99 | Everyone | Medical |
| 2463 | PTA Content Master | MEDICAL | 4.2 | 64 | 41.0 | 1000 | Paid | 29.99 | Everyone | Medical |
When we look at the top 5 apps now, we see that 4 out of 5 are medical apps.
We can work out the highest grossing paid apps now. All we need to do is multiply the values in the price and the installs column to get the number:
df_apps_clean['Revenue_Estimate'] = df_apps_clean.Installs.mul(df_apps_clean.Price)
df_apps_clean.sort_values('Revenue_Estimate', ascending=False)[:10]
| App | Category | Rating | Reviews | Size_MBs | Installs | Type | Price | Content_Rating | Genres | Revenue_Estimate | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 9220 | Minecraft | FAMILY | 4.5 | 2376564 | 19.000000 | 10000000 | Paid | 6.99 | Everyone 10+ | Arcade;Action & Adventure | 69900000.0 |
| 8825 | Hitman Sniper | GAME | 4.6 | 408292 | 29.000000 | 10000000 | Paid | 0.99 | Mature 17+ | Action | 9900000.0 |
| 7151 | Grand Theft Auto: San Andreas | GAME | 4.4 | 348962 | 26.000000 | 1000000 | Paid | 6.99 | Mature 17+ | Action | 6990000.0 |
| 7477 | Facetune - For Free | PHOTOGRAPHY | 4.4 | 49553 | 48.000000 | 1000000 | Paid | 5.99 | Everyone | Photography | 5990000.0 |
| 7977 | Sleep as Android Unlock | LIFESTYLE | 4.5 | 23966 | 0.851562 | 1000000 | Paid | 5.99 | Everyone | Lifestyle | 5990000.0 |
| 6594 | DraStic DS Emulator | GAME | 4.6 | 87766 | 12.000000 | 1000000 | Paid | 4.99 | Everyone | Action | 4990000.0 |
| 6082 | Weather Live | WEATHER | 4.5 | 76593 | 4.750000 | 500000 | Paid | 5.99 | Everyone | Weather | 2995000.0 |
| 7954 | Bloons TD 5 | FAMILY | 4.6 | 190086 | 94.000000 | 1000000 | Paid | 2.99 | Everyone | Strategy | 2990000.0 |
| 7633 | Five Nights at Freddy's | GAME | 4.6 | 100805 | 50.000000 | 1000000 | Paid | 2.99 | Teen | Action | 2990000.0 |
| 6746 | Card Wars - Adventure Time | FAMILY | 4.3 | 129603 | 23.000000 | 1000000 | Paid | 2.99 | Everyone 10+ | Card;Action & Adventure | 2990000.0 |
The top spot of the highest-grossing paid app goes to … Minecraft at close to $70 million. It’s quite interesting that Minecraft (along with Bloons and Card Wars) is actually listed in the Family category rather than in the Game category. If we include these titles, we see that 7 out the top 10 highest-grossing apps are games. The Google Play Store seems to be quite flexible with its category labels.
df_apps_clean.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 8184 entries, 21 to 10835 Data columns (total 11 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 App 8184 non-null object 1 Category 8184 non-null object 2 Rating 8184 non-null float64 3 Reviews 8184 non-null int64 4 Size_MBs 8184 non-null float64 5 Installs 8184 non-null int64 6 Type 8184 non-null object 7 Price 8184 non-null float64 8 Content_Rating 8184 non-null object 9 Genres 8184 non-null object 10 Revenue_Estimate 8184 non-null float64 dtypes: float64(4), int64(2), object(5) memory usage: 767.2+ KB
df_apps_clean.Revenue_Estimate = df_apps_clean.Revenue_Estimate.astype(int)
df_apps_clean.sort_values('Revenue_Estimate', ascending=False)[:10]
| App | Category | Rating | Reviews | Size_MBs | Installs | Type | Price | Content_Rating | Genres | Revenue_Estimate | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 9220 | Minecraft | FAMILY | 4.5 | 2376564 | 19.000000 | 10000000 | Paid | 6.99 | Everyone 10+ | Arcade;Action & Adventure | 69900000 |
| 8825 | Hitman Sniper | GAME | 4.6 | 408292 | 29.000000 | 10000000 | Paid | 0.99 | Mature 17+ | Action | 9900000 |
| 7151 | Grand Theft Auto: San Andreas | GAME | 4.4 | 348962 | 26.000000 | 1000000 | Paid | 6.99 | Mature 17+ | Action | 6990000 |
| 7977 | Sleep as Android Unlock | LIFESTYLE | 4.5 | 23966 | 0.851562 | 1000000 | Paid | 5.99 | Everyone | Lifestyle | 5990000 |
| 7477 | Facetune - For Free | PHOTOGRAPHY | 4.4 | 49553 | 48.000000 | 1000000 | Paid | 5.99 | Everyone | Photography | 5990000 |
| 6594 | DraStic DS Emulator | GAME | 4.6 | 87766 | 12.000000 | 1000000 | Paid | 4.99 | Everyone | Action | 4990000 |
| 6082 | Weather Live | WEATHER | 4.5 | 76593 | 4.750000 | 500000 | Paid | 5.99 | Everyone | Weather | 2995000 |
| 7044 | Tasker | TOOLS | 4.6 | 43045 | 3.400000 | 1000000 | Paid | 2.99 | Everyone | Tools | 2990000 |
| 7954 | Bloons TD 5 | FAMILY | 4.6 | 190086 | 94.000000 | 1000000 | Paid | 2.99 | Everyone | Strategy | 2990000 |
| 7633 | Five Nights at Freddy's | GAME | 4.6 | 100805 | 50.000000 | 1000000 | Paid | 2.99 | Teen | Action | 2990000 |
df_apps_clean.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 8184 entries, 21 to 10835 Data columns (total 11 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 App 8184 non-null object 1 Category 8184 non-null object 2 Rating 8184 non-null float64 3 Reviews 8184 non-null int64 4 Size_MBs 8184 non-null float64 5 Installs 8184 non-null int64 6 Type 8184 non-null object 7 Price 8184 non-null float64 8 Content_Rating 8184 non-null object 9 Genres 8184 non-null object 10 Revenue_Estimate 8184 non-null int32 dtypes: float64(3), int32(1), int64(2), object(5) memory usage: 735.3+ KB
If you were to release an app, would you choose to go after a competitive category with many other apps? Or would you target a popular category with a high number of downloads? Or perhaps you can target a category which is both popular but also one where the downloads are spread out among many different apps. That way, even if it’s more difficult to discover among all the other apps, your app has a better chance of getting installed, right? Let’s analyse this with bar charts and scatter plots and figure out which categories are dominating the market.
df_apps_clean.head(3)
| App | Category | Rating | Reviews | Size_MBs | Installs | Type | Price | Content_Rating | Genres | Last_Updated | Android_Ver | Revenue_Estimate | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 21 | KBA-EZ Health Guide | MEDICAL | 5.0 | 4 | 25.0 | 1 | Free | 0.00 | Everyone | Medical | August 2, 2018 | 4.0.3 and up | 0 |
| 28 | Ra Ga Ba | GAME | 5.0 | 2 | 20.0 | 1 | Paid | 1.49 | Everyone | Arcade | February 8, 2017 | 2.3 and up | 1 |
| 47 | Mu.F.O. | GAME | 5.0 | 2 | 16.0 | 1 | Paid | 0.99 | Everyone | Arcade | March 3, 2017 | 2.3 and up | 0 |
We can find the number of different categories like so:
df_apps_clean.Category.unique()
array(['MEDICAL', 'GAME', 'SPORTS', 'BUSINESS', 'BOOKS_AND_REFERENCE',
'SOCIAL', 'TOOLS', 'FAMILY', 'COMMUNICATION', 'PRODUCTIVITY',
'LIFESTYLE', 'DATING', 'EVENTS', 'MAPS_AND_NAVIGATION', 'SHOPPING',
'PERSONALIZATION', 'PARENTING', 'PHOTOGRAPHY',
'HEALTH_AND_FITNESS', 'FOOD_AND_DRINK', 'NEWS_AND_MAGAZINES',
'FINANCE', 'TRAVEL_AND_LOCAL', 'AUTO_AND_VEHICLES',
'ART_AND_DESIGN', 'BEAUTY', 'VIDEO_PLAYERS', 'COMICS', 'WEATHER',
'HOUSE_AND_HOME', 'LIBRARIES_AND_DEMO', 'EDUCATION',
'ENTERTAINMENT'], dtype=object)
Which shows us that we there are 33 unique categories.
To calculate the number of apps per category we can use our old friend .value_counts():
df_apps_clean.Category.value_counts()
FAMILY 1606 GAME 910 TOOLS 719 PRODUCTIVITY 301 PERSONALIZATION 298 LIFESTYLE 297 FINANCE 296 MEDICAL 292 PHOTOGRAPHY 263 BUSINESS 262 SPORTS 260 COMMUNICATION 257 HEALTH_AND_FITNESS 243 NEWS_AND_MAGAZINES 204 SOCIAL 203 TRAVEL_AND_LOCAL 187 SHOPPING 180 BOOKS_AND_REFERENCE 169 VIDEO_PLAYERS 148 DATING 134 MAPS_AND_NAVIGATION 118 EDUCATION 118 ENTERTAINMENT 102 FOOD_AND_DRINK 94 AUTO_AND_VEHICLES 73 WEATHER 72 LIBRARIES_AND_DEMO 64 HOUSE_AND_HOME 62 ART_AND_DESIGN 61 COMICS 54 PARENTING 50 EVENTS 45 BEAUTY 42 Name: Category, dtype: int64
Or we can check first 10 categories:
top10_category = df_apps_clean.Category.value_counts()[:10]
print(top10_category)
FAMILY 1606 GAME 910 TOOLS 719 PRODUCTIVITY 301 PERSONALIZATION 298 LIFESTYLE 297 FINANCE 296 MEDICAL 292 PHOTOGRAPHY 263 BUSINESS 262 Name: Category, dtype: int64
bar = px.bar(x = top10_category.index, # index = category name
y = top10_category.values)
bar.show()
Based on the number of apps, the Family and Game categories are the most competitive. Releasing yet another app into these categories will make it hard to get noticed.
But what if we look at it from a different perspective? What matters is not just the total number of apps in the category but how often apps are downloaded in that category. This will give us an idea of how popular a category is. First, we have to group all our apps by category and sum the number of installations:
category_installs = df_apps_clean.groupby('Category').agg({'Installs': pd.Series.sum})
category_installs
| Installs | |
|---|---|
| Category | |
| ART_AND_DESIGN | 114233100 |
| AUTO_AND_VEHICLES | 53129800 |
| BEAUTY | 26916200 |
| BOOKS_AND_REFERENCE | 1665791655 |
| BUSINESS | 692018120 |
| COMICS | 44931100 |
| COMMUNICATION | 11039241530 |
| DATING | 140912410 |
| EDUCATION | 352852000 |
| ENTERTAINMENT | 2113660000 |
| EVENTS | 15949410 |
| FAMILY | 4437554490 |
| FINANCE | 455249400 |
| FOOD_AND_DRINK | 211677750 |
| GAME | 13858762717 |
| HEALTH_AND_FITNESS | 1134006220 |
| HOUSE_AND_HOME | 97082000 |
| LIBRARIES_AND_DEMO | 52083000 |
| LIFESTYLE | 503611120 |
| MAPS_AND_NAVIGATION | 503267560 |
| MEDICAL | 39162676 |
| NEWS_AND_MAGAZINES | 2369110650 |
| PARENTING | 31116110 |
| PERSONALIZATION | 1532352930 |
| PHOTOGRAPHY | 4649143130 |
| PRODUCTIVITY | 5788070180 |
| SHOPPING | 1400331540 |
| SOCIAL | 5487841475 |
| SPORTS | 1096431465 |
| TOOLS | 8099724500 |
| TRAVEL_AND_LOCAL | 2894859300 |
| VIDEO_PLAYERS | 3916897200 |
| WEATHER | 361096500 |
category_installs.sort_values('Installs', ascending=True, inplace=True)
category_installs.head(10)
| Installs | |
|---|---|
| Category | |
| EVENTS | 15949410 |
| BEAUTY | 26916200 |
| PARENTING | 31116110 |
| MEDICAL | 39162676 |
| COMICS | 44931100 |
| LIBRARIES_AND_DEMO | 52083000 |
| AUTO_AND_VEHICLES | 53129800 |
| HOUSE_AND_HOME | 97082000 |
| ART_AND_DESIGN | 114233100 |
| DATING | 140912410 |
Then we can create a horizontal bar chart, simply by adding the orientation parameter:
h_bar = px.bar(x = category_installs.Installs,
y = category_installs.index,
orientation='h')
h_bar.show()
We can also add a custom title and axis labels like so:
h_bar = px.bar(x = category_installs.Installs,
y = category_installs.index,
orientation='h',
title='Category Popularity')
h_bar.update_layout(xaxis_title='Number of Downloads', yaxis_title='Category')
h_bar.show()
Now we see that Games and Tools are actually the most popular categories. If we plot the popularity of a category next to the number of apps in that category we can get an idea of how concentrated a category is. Do few apps have most of the downloads or are the downloads spread out over many apps?
Let's create a DataFrame that has the number of apps in one column and the number of installs in another:
number_of_apps = df_apps_clean.groupby('Category').agg({'App': pd.Series.count})
number_of_apps
| App | |
|---|---|
| Category | |
| ART_AND_DESIGN | 61 |
| AUTO_AND_VEHICLES | 73 |
| BEAUTY | 42 |
| BOOKS_AND_REFERENCE | 169 |
| BUSINESS | 262 |
| COMICS | 54 |
| COMMUNICATION | 257 |
| DATING | 134 |
| EDUCATION | 118 |
| ENTERTAINMENT | 102 |
| EVENTS | 45 |
| FAMILY | 1606 |
| FINANCE | 296 |
| FOOD_AND_DRINK | 94 |
| GAME | 910 |
| HEALTH_AND_FITNESS | 243 |
| HOUSE_AND_HOME | 62 |
| LIBRARIES_AND_DEMO | 64 |
| LIFESTYLE | 297 |
| MAPS_AND_NAVIGATION | 118 |
| MEDICAL | 292 |
| NEWS_AND_MAGAZINES | 204 |
| PARENTING | 50 |
| PERSONALIZATION | 298 |
| PHOTOGRAPHY | 263 |
| PRODUCTIVITY | 301 |
| SHOPPING | 180 |
| SOCIAL | 203 |
| SPORTS | 260 |
| TOOLS | 719 |
| TRAVEL_AND_LOCAL | 187 |
| VIDEO_PLAYERS | 148 |
| WEATHER | 72 |
Then we can use .merge() and combine the two DataFrames:
cat_merged_df = pd.merge(number_of_apps, category_installs, on='Category', how="inner")
print(f'The dimensions of the DataFrame are: {cat_merged_df.shape}')
cat_merged_df.sort_values('Installs', ascending=False)
The dimensions of the DataFrame are: (33, 2)
| App | Installs | |
|---|---|---|
| Category | ||
| GAME | 910 | 13858762717 |
| COMMUNICATION | 257 | 11039241530 |
| TOOLS | 719 | 8099724500 |
| PRODUCTIVITY | 301 | 5788070180 |
| SOCIAL | 203 | 5487841475 |
| PHOTOGRAPHY | 263 | 4649143130 |
| FAMILY | 1606 | 4437554490 |
| VIDEO_PLAYERS | 148 | 3916897200 |
| TRAVEL_AND_LOCAL | 187 | 2894859300 |
| NEWS_AND_MAGAZINES | 204 | 2369110650 |
| ENTERTAINMENT | 102 | 2113660000 |
| BOOKS_AND_REFERENCE | 169 | 1665791655 |
| PERSONALIZATION | 298 | 1532352930 |
| SHOPPING | 180 | 1400331540 |
| HEALTH_AND_FITNESS | 243 | 1134006220 |
| SPORTS | 260 | 1096431465 |
| BUSINESS | 262 | 692018120 |
| LIFESTYLE | 297 | 503611120 |
| MAPS_AND_NAVIGATION | 118 | 503267560 |
| FINANCE | 296 | 455249400 |
| WEATHER | 72 | 361096500 |
| EDUCATION | 118 | 352852000 |
| FOOD_AND_DRINK | 94 | 211677750 |
| DATING | 134 | 140912410 |
| ART_AND_DESIGN | 61 | 114233100 |
| HOUSE_AND_HOME | 62 | 97082000 |
| AUTO_AND_VEHICLES | 73 | 53129800 |
| LIBRARIES_AND_DEMO | 64 | 52083000 |
| COMICS | 54 | 44931100 |
| MEDICAL | 292 | 39162676 |
| PARENTING | 50 | 31116110 |
| BEAUTY | 42 | 26916200 |
| EVENTS | 45 | 15949410 |
Now we can create the chart. Note that we can pass in an entire DataFrame and specify which columns should be used for the x and y by column name.
scatter = px.scatter(cat_merged_df, # data
x='App', # column name
y='Installs',
title='Category Concentration',
size='App',
hover_name=cat_merged_df.index,
color='Installs')
scatter.update_layout(xaxis_title="Number of Apps (Lower=More Concentrated)",
yaxis_title="Installs",
yaxis=dict(type='log'))
scatter.show()
Let’s turn our attention to the Genres column. This is quite similar to the categories column but more granular.
How many different types of genres are there? Can an app belong to more than one genre? Check what happens when you use .value_counts() on a column with nested values? See if you can work around this problem by using the .split() function and the DataFrame's .stack() method.
If we look at the number of unique values in the Genres column we get 114. But this is not accurate if we have nested data like we do here. We can see this using .value_counts() and looking at the values that just have a single entry. There we see that the semi-colon (;) separates the genre names.
# Number of genres?
len(df_apps_clean.Genres.unique())
114
# Problem: has multiple categories separated by ; like Travel & Local;Action & Adventure 1
df_apps_clean.Genres.value_counts().sort_values(ascending=True)[:5]
Strategy;Creativity 1 Tools;Education 1 Art & Design;Pretend Play 1 Role Playing;Brain Games 1 Strategy;Education 1 Name: Genres, dtype: int64
We somehow need to separate the genre names to get a clear picture. This is where the string’s .split() method comes in handy. After we’ve separated our genre names based on the semi-colon, we can add them all into a single column with .stack() and then use .value_counts().
# Split the strings on the semi-colon and then .stack them.
stack = df_apps_clean.Genres.str.split(';', expand=True).stack()
print(f'We now have a single column with shape: {stack.shape}')
num_genres = stack.value_counts()
print(f'Number of genres: {len(num_genres)}')
We now have a single column with shape: (8564,) Number of genres: 53
num_genres.sort_values(ascending=True)[:5]
Music & Audio 1 Music 21 Word 22 Trivia 28 Music & Video 31 dtype: int64
stack.value_counts().sort_values(ascending=True)[:5]
Music & Audio 1 Music 21 Word 22 Trivia 28 Music & Video 31 dtype: int64
This shows us we actually have 53 different genres.
Let's create this chart with the Series containing the genre data?
bar_genre = px.bar(x = num_genres.index[:15], # index = category name
y = num_genres.values[:15])
bar_genre.show()
Try experimenting with the built-in colour scales in Plotly. You can find a full list here: https://plotly.com/python/builtin-colorscales/
Find a way to set the colour scale using the color_continuous_scale parameter.
Find a way to make the colour axis disappear by using coloraxis_showscale.
num_genres.sort_values(ascending=False)[:15]
Tools 719 Education 587 Entertainment 498 Action 304 Productivity 301 Lifestyle 298 Personalization 298 Finance 296 Medical 292 Sports 270 Photography 263 Business 262 Communication 258 Health & Fitness 245 Casual 216 dtype: int64
bar_genre = px.bar(x = num_genres.index[:15], # index = category name
y = num_genres.values[:15], # count
title='Top Genres',
hover_name=num_genres.index[:15],
color=num_genres.values[:15],
color_continuous_scale='Agsunset')
bar_genre.update_layout(xaxis_title='Genre',
yaxis_title='Number of Apps',
coloraxis_showscale=False)
bar_genre.show()
Now that we’ve looked at the total number of apps per category and the total number of apps per genre, let’s see what the split is between free and paid apps.
df_apps_clean.Type.value_counts()
Free 7595 Paid 589 Name: Type, dtype: int64
We see that the majority of apps are free on the Google Play Store. But perhaps some categories have more paid apps than others. Let’s investigate. We can group our data first by Category and then by Type. Then we can add up the number of apps per each type. Using as_index=False we push all the data into columns rather than end up with our Categories as the index.
df_free_vs_paid = df_apps_clean.groupby(["Category", "Type"]).agg({'App': pd.Series.count})
df_free_vs_paid.head()
| App | ||
|---|---|---|
| Category | Type | |
| ART_AND_DESIGN | Free | 58 |
| Paid | 3 | |
| AUTO_AND_VEHICLES | Free | 72 |
| Paid | 1 | |
| BEAUTY | Free | 42 |
df_free_vs_paid = df_apps_clean.groupby(["Category", "Type"], as_index=False).agg({'App': pd.Series.count})
df_free_vs_paid.head(10)
| Category | Type | App | |
|---|---|---|---|
| 0 | ART_AND_DESIGN | Free | 58 |
| 1 | ART_AND_DESIGN | Paid | 3 |
| 2 | AUTO_AND_VEHICLES | Free | 72 |
| 3 | AUTO_AND_VEHICLES | Paid | 1 |
| 4 | BEAUTY | Free | 42 |
| 5 | BOOKS_AND_REFERENCE | Free | 161 |
| 6 | BOOKS_AND_REFERENCE | Paid | 8 |
| 7 | BUSINESS | Free | 253 |
| 8 | BUSINESS | Paid | 9 |
| 9 | COMICS | Free | 54 |
df_free_vs_paid
| Category | Type | App | |
|---|---|---|---|
| 0 | ART_AND_DESIGN | Free | 58 |
| 1 | ART_AND_DESIGN | Paid | 3 |
| 2 | AUTO_AND_VEHICLES | Free | 72 |
| 3 | AUTO_AND_VEHICLES | Paid | 1 |
| 4 | BEAUTY | Free | 42 |
| ... | ... | ... | ... |
| 56 | TRAVEL_AND_LOCAL | Paid | 8 |
| 57 | VIDEO_PLAYERS | Free | 144 |
| 58 | VIDEO_PLAYERS | Paid | 4 |
| 59 | WEATHER | Free | 65 |
| 60 | WEATHER | Paid | 7 |
61 rows × 3 columns
df_paid_apps = df_free_vs_paid[df_free_vs_paid.Type == "Paid"]
df_paid_apps
| Category | Type | App | |
|---|---|---|---|
| 1 | ART_AND_DESIGN | Paid | 3 |
| 3 | AUTO_AND_VEHICLES | Paid | 1 |
| 6 | BOOKS_AND_REFERENCE | Paid | 8 |
| 8 | BUSINESS | Paid | 9 |
| 11 | COMMUNICATION | Paid | 22 |
| 13 | DATING | Paid | 3 |
| 15 | EDUCATION | Paid | 4 |
| 17 | ENTERTAINMENT | Paid | 2 |
| 20 | FAMILY | Paid | 150 |
| 22 | FINANCE | Paid | 7 |
| 24 | FOOD_AND_DRINK | Paid | 2 |
| 26 | GAME | Paid | 76 |
| 28 | HEALTH_AND_FITNESS | Paid | 11 |
| 32 | LIFESTYLE | Paid | 13 |
| 34 | MAPS_AND_NAVIGATION | Paid | 5 |
| 36 | MEDICAL | Paid | 63 |
| 38 | NEWS_AND_MAGAZINES | Paid | 2 |
| 40 | PARENTING | Paid | 2 |
| 42 | PERSONALIZATION | Paid | 65 |
| 44 | PHOTOGRAPHY | Paid | 15 |
| 46 | PRODUCTIVITY | Paid | 18 |
| 48 | SHOPPING | Paid | 2 |
| 50 | SOCIAL | Paid | 2 |
| 52 | SPORTS | Paid | 22 |
| 54 | TOOLS | Paid | 63 |
| 56 | TRAVEL_AND_LOCAL | Paid | 8 |
| 58 | VIDEO_PLAYERS | Paid | 4 |
| 60 | WEATHER | Paid | 7 |
Unsurprisingly the biggest categories have the most paid apps. However, there might be some patterns if we put the numbers of a graph!
We can use the plotly express bar chart examples https://plotly.com/python/bar-charts/#bar-chart-with-sorted-or-ordered-categories
and the .bar() API reference https://plotly.com/python-api-reference/generated/plotly.express.bar.html#plotly.express.barto create a bar chart:
The key is using the color and barmode parameters for the .bar() method. To get a particular order, you can pass a dictionary to the axis parameter in .update_layout().
bar_free_vs_paid = px.bar(df_free_vs_paid,
x='Category',
y='App',
title='Free vs Paid Apps by Category',
color='Type',
barmode='group')
bar_free_vs_paid.update_layout(xaxis_title='Category',
yaxis_title='Number of Apps',
xaxis={'categoryorder':'total descending'},
yaxis=dict(type='log'))
bar_free_vs_paid.show()
What we see is that while there are very few paid apps on the Google Play Store, some categories have relatively more paid apps than others, including Personalization, Medical and Weather. So, depending on the category you are targeting, it might make sense to release a paid-for app.
But this leads to many more questions:
How much should you charge? What are other apps charging in that category?
How much revenue could you make?
And how many downloads are you potentially giving up because your app is paid?
Box plots show us some handy descriptive statistics in a graph - things like the median value, the maximum value, the minimum value, and some quartiles. Here’s what we’re after:
Let's create a box plot that shows the number of Installs for free versus paid apps. How does the median number of installations compare? Is the difference large or small?
Use the Box Plots Guide https://plotly.com/python/box-plots/ and the .box API reference https://plotly.com/python-api-reference/generated/plotly.express.box.html to create the chart above.
df_apps_clean.head(3)
| App | Category | Rating | Reviews | Size_MBs | Installs | Type | Price | Content_Rating | Genres | Revenue_Estimate | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 21 | KBA-EZ Health Guide | MEDICAL | 5.0 | 4 | 25.0 | 1 | Free | 0.00 | Everyone | Medical | 0 |
| 28 | Ra Ga Ba | GAME | 5.0 | 2 | 20.0 | 1 | Paid | 1.49 | Everyone | Arcade | 1 |
| 47 | Mu.F.O. | GAME | 5.0 | 2 | 16.0 | 1 | Paid | 0.99 | Everyone | Arcade | 0 |
box = px.box(df_apps_clean,
y='Installs',
x='Type',
color='Type',
notched=True,
points='all',
title='How Many Downloads are Paid Apps Giving Up?')
box.update_layout(yaxis=dict(type='log'))
box.show()
From the hover text in the chart, we see that the median number of downloads for free apps is 500,000, while the median number of downloads for paid apps is around 5,000! This is massively lower.
But does this mean we should give up on selling a paid app? Let’s see how much revenue we would estimate per category.
If an Android app costs 30,000 to develop, then the average app in very few categories would cover that development cost. The median paid photography app earned about 20,000. Many more app’s revenues were even lower - meaning they would need other sources of revenue like advertising or in-app purchases to make up for their development costs. However, certain app categories seem to contain a large number of outliers that have much higher (estimated) revenue - for example in Medical, Personalisation, Tools, Game, and Family.
So, if you were to list a paid app, how should you price it? To help you decide we can look at how your competitors in the same category price their apps.
df_paid_apps = df_apps_clean[df_apps_clean['Type'] == 'Paid']
box = px.box(df_paid_apps,
x='Category',
y='Revenue_Estimate',
title='How Much Can Paid Apps Earn?')
box.update_layout(xaxis_title='Category',
yaxis_title='Paid App Ballpark Revenue',
xaxis={'categoryorder':'min ascending'},
yaxis=dict(type='log'))
box.show()
What is the median price for a paid app? Let's compare pricing by category by creating another box plot. But this time examine the prices (instead of the revenue estimates) of the paid apps. We can use {categoryorder':'max descending'} to sort the categories.
df_paid_apps.Price.median()
2.99
The median price for an Android app is 2.99.
However, some categories have higher median prices than others. This time we see that Medical apps have the most expensive apps as well as a median price of 5.49. In contrast, Personalisation apps are quite cheap on average at 1.49. Other categories which higher median prices are Business (4.99) and Dating (6.99). It seems like customers who shop in these categories are not so concerned about paying a bit extra for their apps.
box = px.box(df_paid_apps,
x='Category',
y="Price",
title='Price per Category')
box.update_layout(xaxis_title='Category',
yaxis_title='Paid App Price',
xaxis={'categoryorder':'max descending'},
yaxis=dict(type='log'))
box.show()